Training topic classifiers for conversational speech with limited data

نویسندگان

  • Rukmini Iyer
  • Jeff Z. Ma
  • Herbert Gish
  • Owen Kimball
چکیده

In this paper we demonstrate how automatically generated transcriptions can be used to develop an effective topic classification application. Two key contributions of our work are (a) investigating the impact of unsupervised transcriptions on topic classification where the transcription system has been trained with very limited amounts of data, and (b) demonstrating the use of mixture language models that significantly improve topic classification performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures

Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.

متن کامل

Confidence-Based Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech

We investigate the impact of automatic speech recognition errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF featureweighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using one reco...

متن کامل

Using conversational word bursts in spoken term detection

We describe a language independent word burst feature based on the structure of conversational speech that can be used to improve spoken term detection (STD) performance. Word burst refers to a phenomenon in conversational speech in which particular content words tend to occur in close proximity of each other as a byproduct of the topic under discussion. To take advantage of bursts, we describe...

متن کامل

A Boosting Approach to Topic Spotting on Subdialogues

We report the results of a study on topic spotting in conversational speech. Using a machine learning approach, we build classifiers that accept an audio file of conversational human speech as input, and output an estimate of the topic being discussed. Our methodology makes use of a wellknown corpus of transcribed and topic-labeled speech (the Switchboard corpus), and involves an interesting do...

متن کامل

Class-dependent Interpolation for Estimating Language Models from Multiple Text Sources

Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002